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Abstract 



The problem of estimation of density functionals like entropy and mutual informa- 
tion has received much attention in the statistics and information theory communities. 
A large class of estimators of functionals of the probability density suffer from the curse 
of dimensionality, wherein the mean squared error (MSE) decays increasingly slowly as a 
function of the sample size T as the dimension d of the samples increases. In particular, 
the rate is often glacially slow of order 0{T~"'''^), where 7 > is a rate parameter. Ex- 
^^ amples of such estimators include kernel density estimators, fc-nearest neighbor (fc-NN) 

Qs^ density estimators, /c-NN entropy estimators, intrinsic dimension estimators and other 

CN examples. In this paper, we propose a weighted affine combination of an ensemble of 

such estimators, where optimal weights can be chosen such that the weighted estimator 
converges at a much faster dimension invariant rate of 0{T~^). Furthermore, we show 
that these optimal weights can be determined by solving a convex optimization problem 
CN which can be performed offline and does not require training data. We illustrate the 

^^ superior performance of our weighted estimator for two important applications: (i) es- 

^ timating the Panter-Dite distortion-rate factor and (ii) estimating the Shannon entropy 

K> for testing the probability distribution of a random sample. 



1 Introduction 

Non-linear functionals of probability densities / of the form G{f) = J g{f{x),x)f{x)dx arise 
in applications of information theory, machine learning, signal processing and statistical 
estimation. Important examples of such functionals include Shannon g{f,x) = — log(/) 
and Renyi g{f,x) = /""^ entropy, and the quadratic functional g{f,x) = p. In these 
applications, the functional of interest often must be estimated empirically from sample 
realizations of the underlying densities. 

Functional estimation has received significant attention in the mathematical statistics 
community. However, estimators of functionals of multivariate probability densities / suffer 



from mean square error (MSE) rates which typically decrease with dimension d of the sample 
as 0(T~'^/°'), where T is the number of samples and 7 is a positive rate parameter. Ex- 
amples of such estimators include kernel density estimators [18j, fc-nearest neighbor (fc-NN) 
density estimators |4], fc-NN entropy functional estimators [21 |T71 [T3], intrinsic dimension 
estimators [T7|, divergence estimators [12], and mutual information estimators. This slow 
convergence is due to the curse of dimensionality. In this paper, we introduce a simple affine 
combination of an ensemble of such slowly convergent estimators and show that the weights 
in this combination can be chosen to significantly improve the rate of MSE convergence of the 
weighted estimator. In fact our ensemble averaging method can improve MSE convergence 
to the parametric rate 0(T^^). 

Specifically, for d-dimensional data, it has been observed that the variance of estimators 
of functional G{f) decays as 0{T~^) while the bias decays as 0{T~^^^^~^'^'>). To accelerate 
the slow rate of convergence of the bias in high dimensions, we propose a weighted ensemble 



estimator for ensembles of estimators that satisfy conditions ^.1(2.1) and ^.2(2.2) defined 
in Sec. II below. Optimal weights, which serve to lower the bias of the ensemble estimator 
to 0(T~^'^), can be determined by solving a convex optimization problem. Remarkably, this 
optimization problem does not involve any density-dependent parameters and can therefore 
be performed offline. This then ensures MSE convergence of the weighted estimator at the 
parametric rate of 0{T^^). 

1.1 Related work 

When the density / is s > d/A times differentiable, certain estimators of functionals of the 
form J g{f{x),x)f{x)dx, proposed by Birge and Massart [2], Laurent fTlj and Gine and 
Mason [3], can achieve the parametric MSE convergence rate of 0{T^^). The key ideas in 
[21 [m [S] are: (i) estimation of quadratic functionals / p{x)dx with MSE convergence rate 
0{T~^)\ (ii) use of kernel density estimators with kernels that satisfy the following symmetry 
constraints: 

K{x)dx = l, x''K{x)dx = 0, (1.1) 



for r = 1, .., s; and finally (iii) truncating the kernel density estimate so that it is bounded 
away from 0. By using these ideas, the estimators proposed by [21 [HI [S] are able to achieve 
parametric convergence rates. 

In contrast, the estimators proposed in this paper require additional higher order smooth- 
ness conditions on the density, i. e. the density must he s > d times differentiable. However, 
our estimators are much simpler to implement in contrast to the estimators proposed in 
[21 HH E]- In particular, the estimators in [21 El IS] require separately estimating quadratic 
functionals of the form J f'^{x)dx, and using truncated kernel density estimators with sym- 



metric kernels (1.1 ), conditions that are not required in this paper. Our estimator is a simple 
afflne combination of an ensemble of estimators, where the ensemble satisfies conditions ^.1 
and "^.2. Such an ensemble can be trivial to implement. For instance, in this paper we show 



that simple uniform kernel plug-in estimators (3.3) satisfy conditions ^.1 and ^.2 



Ensemble based methods have been previously proposed in the context of classification. 
For example, in both boosting [T6j and multiple kernel learning flO] algorithms, lower com- 
plexity weak learners are combined to produce classifiers with higher accuracy. Our work 
differs from these methods in several ways. First and foremost, our proposed method performs 
estimation rather than classification. An important consequence of this is that the weights 
we use are data independent, while the weights in boosting and multiple kernel learning must 
be estimated from training data since they depend on the unknown distribution. 

1.2 Organization 

The remainder of the paper is organized as follows. We formally describe the weighted 
ensemble estimator for a general ensemble of estimators in Section |2j and specify conditions 
^.1 and ^.2 on the ensemble that ensure that the ensemble estimator has a faster rate of MSE 
convergence. Under the assumption that conditions ^.1 and ^.2 are satisfied, we provide 



an MSE optimal set of weights as the solution to a convex optimization (2.3). Next, we shift 
the focus to entropy estimation in Section [3} propose an ensemble of simple uniform kernel 
plug-in entropy estimators, and show that this ensemble satisfies conditions ^.1 and ^.2. 
Subsequently, we apply the ensemble estimator theory in Section |2] to the problem of entropy 
estimation using this ensemble of kernel plug-in estimators. We present simulation results 
in Section |4] that illustrate the superior performance of this ensemble entropy estimator in 
the context of (i) estimation of the Panter-Dite distortion-rate factor [B] and (ii) testing the 
probability distribution of a random sample. We conclude the paper in Section [5j 

Notation 

We will use bold face type to indicate random variables and random vectors and regular type 
face for constants. We denote the statistical expectation operator by the symbol E and the 
conditional expectation given random variable Z using the notation Ez. We also define the 
variance operator as V[X] = E[(X — E[X])^] and the covariance operator as Cov[X, Y] = 
E[(X - E[X])(Y - E[Y])]. We denote the bias of an estimator by B. 

2 Ensemble estimators 

Let / = {/i,. .,//,} denote a set of parameter values. For a parameterized ensemble of 
estimators {Eii}i^i of E, define the weighted ensemble estimator with respect to weights 
w = {w{li),. .. ,w{Il)} as 

lei 

where the weights satisfy J^m'^i^) — ^- This latter sum-to-one condition guarantees that E^ 
is asymptotically unbiased if the component estimators {E;};g[ are asymptotically unbiased. 
Let this ensemble of estimators {Ei}i^i satisfy the following two conditions: 



^.1 The bias is given by 

B(E;) = ^QV.(/)T-*/'' + 0(l/yT), (2.1) 



where q are constants that depend on the underlying density, J = {ii, ..,i/} is a finite 
index set with cardinahty I < L, niin(J) = io > and niax(J) = id ^ d, and ipi{l) are 
basis functions that depend only on the estimator parameter /. 

• ^.2 The variance is given by 

V(EO = c.(^i^+o(^^. (2.2) 

Theorem 1. For an ensemble of estimators {Ei}i^i, assume that the conditions ^.1 and^.2 
hold. Then, there exists a weight vector Wo such that 

E[(E^„-E)2] = 0(l/r). 

This weight vector can be found by solving the following convex optimization problem: 

minimize \\w\\2 

w 



bject to 2, ^(0 — I5 

lw{i) = ^w{l)iJi{l) = 0, i eJ, 



(2.3) 



i&i 
where ipiil) is the basis defined in (2.1). 
Proof. The bias of the ensemble estimator is given by 



i& \ V-l- / 



(2,4) 



ited 



Denote the covariance matrix of {E^; Z G /} by S^. Let S^ = S^T. Observe that by (2.2) 
and the Cauchy-Schwarz inequality, the entries of S/, are 0(1). The variance of the weig^ 
estimator E^ can then be bounded as follows: 

V(EJ = V(^^,E,)=^'Sl^ = ^^^ 
i&i 



Amax(SL)IH|| ^ trace{J:L)\\w\\l ^ L|H'2 
T - T - T 



^ ^VmaxV^L;M'^M2 ^ »-' '^'-^K^LJ \\UJ\\2 ^ ^\\<^\\2 ,cy j-N 



We seek a weight vector w that (i) ensures that the bias of the weighted estimator is 
0{T~^^'^) and (ii) has low £2 norm ||w||2 in order to hmit the contribution of the variance, 
and the higher order bias terms of the weighted estimator. To this end, let Wo be the solution 
to the convex optimization problem defined in (2.3). The solution Wq is the solution of 



minimize \\u!\\l 

w 

subject to Aqw = b 



where Aq and b are defined below. Let Oq be the vector of ones: [1, 1..., 1]ixl; and let Oj, for 
each i eJ he given by a^ = [ipiili), ..,ipi{lL)]. Define Aq = [ag, a-^, ..., a-J', Ai = [a-^, ..., a'.J' 
and b= [1; 0; 0; ..; 0](/+i)xi. 

Since L > I, the system of equations Aqw = b is guaranteed to have at least one solution 
(assuming linear independence of the rows Oj). The minimum squared norm r]L{d) '■= H'W^oHi 
is then given by 



det{AoA'Q 



Consequently, by (|2^, the bias B[E^J = 0{yfLr]L{d)/^/T). By ([2^, the estimator 
variance V[E^„] = 0{Lrii{d)/T). The overall MSE is also therefore of order 0{LriL{d)/T). 

For any fixed dimension d and fixed number of estimators L > J in the ensemble indepen- 
dent of sample size T, the value of rj^i^d) is also independent of T. Stated mathematically, 
Lfjiid) = 0(1) for any fixed dimension d and fixed number of estimators L > I independent 
of sample size T. This concludes the proof. 

D 



In the next section, we will verify conditions ^.1(2.1[) and '^.2(2.2) for plug-in estimators 



Gk{f) of entropy-like functionals G{f) = J g{f{x),x)f{x)dx. 

3 Application to estimation of functionals of a density 

Our focus is the estimation of general non-linear functionals G{f) of (i-dimensional multi- 
variate densities / with known finite support S = [a, b^, where G{f) has the form 

G{f) = Jg{f{x),x)f{x)dx, (3.1) 

for some smooth function g{f, x). Let S denote the boundary of S. Assume that T = N + M 
i.i.d realizations {Xi, . . . , X^r, X^v+i, • • • , X^v+a/} are available from the density /. 

3.1 Plug-in estimators of entropy 

The truncated uniform kernel density estimator is defined below. For any positive real 
number k < M, define the distance dk to be: dk = (k/MY^'^. Define the truncated kernel 
region for each X G S to be Sk{X) = {Y G S : ||X — F||oo < dk/2}, and the volume of 



the truncated uniform kernel to be Vk{X) = J^ ,^. dz. Note that when the smallest distance 
from X to S is greater than (ifc/2, Vk{X) = df = k/M. Let Ifc(X) denote the number of 
samples falling in Sk{X): \k{X) = ^^^^ IfXieSfeCX)}- The truncated uniform kernel density 
estimator is defined as 

«^-) - wiry P-^) 

The plug-in estimator of the density functional is constructed using a data splitting 
approach as follows. The data is randomly subdivided into two parts {Xi,...,X7v} and 
{Xat+i, . . . , Xjv+A/} of N and M points respectively. In the first stage, we form the kernel den- 
sity estimate f^ at the N points {Xi, . . . , Xjv} using the M realizations {Xjv+i, . . . , Xtv+m}- 
Subsequently, we use the N samples {Xi, . . . ,X7v} to approximate the functional G{f) and 
obtain the plug- in estimator: 



1 ^ 



Also define a standard kernel density estimator f^, which is identical to f^ except that the 
volume Vfc(X) is always set to the untruncated value Vfc(X) = k/M. Define 

1 ^ 
^'^ = ]^E^(f^W'^^)- (3-4) 

The estimator Gk is identical to the estimator of Gyorfi and van der Meulen [8]. Observe 
that the implementation of G^, unlike G^, does not require knowledge about the support of 
the density. 

3.1.1 Assumptions 

We make a number of technical assumptions that will allow us to obtain tight MSE conver- 
gence rates for the kernel density estimators defined above. (A.O) : Assume that k = k^M^ 
for some rate constant < /3 < 1, and assume that M , N and T are linearly related through 
the proportionality constant afrac with: < afrac < 1, M = afracT and N = {1 — afrac)T. 
{A.l) : Let the density / be uniformly bounded away from and upper bounded on the 
set S, i.e., there exist constants eo, eoo such that < eo < /(x) < eoo < cxd Vx e §. (^1.2): 
Assume that the density / has continuous partial derivatives of order d in the interior of 
the set S, and that these derivatives are upper bounded, (yi.3): Assume that the function 
g{f,x) has max{A,(i} partial derivatives w.r.t. the argument /, where A satisfies the condi- 
tion A/3 > 1. Denote the n-th partial derivative oi g{f,x) wrt x by g^'^\f,x). (AA): Assume 
that the absolute value of the functional g{f,x) and its partial derivatives are strictly upper 
bounded in the range eo < / < eoo for all x. {A.5): Let e G (0, 1) and 6 G (2/3, 1). Let e(M) 
be a positive function satisfying the condition C(M) = B(exp(— M^*^^^^^)). For some fixed 
< e < 1, define p/ = (1 — e)eo and p„ = (1 + e)eoo. Assume that the conditions 

(i) sup |/i(0, x) I < Gi < oo. 



{ii) sup \h{f,x)\ < G2 < oo, 

{in) sup |/i(/,a;)|e(M) <G'3<oo VM, 

/e(i/fc,p„),x 

(iv) sup |/?,(/,x)|e(M) <G'4<oo VM, 

/e{w,2'*Af/fc),x 

are satisfied by h(f,x) = g{f,x),g^^\f,x) and g^'^\f,x), for some constants Gi, G2, G^ and 

Gr4. 

Tfiese assumptions are comparable to otlier rigorous treatments of entropy estimation. 
Tlie assumption (^.0) is equivalent to choosing the bandwidth of the kernel to be a frac- 
tional power of the sample size [15] . The rest of the above assumptions can be divided into 
two categories: (i) assumptions on the density /, and (ii) assumptions on the functional g. 
The assumptions on the smoothness, boundedness away from and 00 of the density / are 
similar to the assumptions made by other estimators of entropy as listed in Section II, pQ. 
The assumptions on the functional g ensure that g is sufficiently smooth and that the esti- 
mator is bounded. These assumptions on the functional are readily satisfied by the common 
functionals that are of interest in literature: Shannon g{f,x) = — log(/)/(/ > 0) + /(/ = 0) 
and Renyi g{f,x) = /"~^/(/ > 0) + /(/ = 0) entropy, where J(.) is the indicator function, 
and the quadratic functional g{f,x) = p. 

3.1.2 Analysis of MSE 

Under the assumptions stated above, we have shown the following in the Appendix: 
Theorem 2. The biases of the plug-in estimators G^, Gfc are given by 

where Ci^i, Ci and Ci are constants that depend on g and f . 

Theorem 3. The variances of the plug-in estimators Gfc, G^ are identical up to leading 
terms, and are given by 

V(G.) ^ .,(i)+c,(i)+o(i + i 

V(G.) = c.(l)+c,(l)+o(J^ + i 

where C4 and C5 are constants that depend on g and f . 
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3.1.3 Optimal MSE rate 

From Theorem [21 observe that the conditions fc — )■ oo and k/M — )■ are necessary for the 
estimators G^ and G^ to be unbiased. Likewise from Theorem [sj the conditions A^ — )■ oo 
and M — )■ cxD are necessary for the variance of the estimator to converge to 0. Below, we 
optimize the choice of bandwidth k for minimum MSE, and also show that the optimal MSE 
rate is invariant to the choice of afrac- 

Optimal choice of k Minimizing the MSE over k is equivalent to minimizing the square 
of the bias over k. The optimal choice of k is given by 

ko,t = e(Mi/i+'^), (3.5) 

and the bias evaluated at kopt is B(M~^/^'^'^). 

Choice of afrac Observe that the MSE of G^ and G^ are dominated by the squared 
bias (9(M-^/(^+'^))) as contrasted to the variance (6(1/A^ + 1/M)). This implies that the 
asymptotic MSE rate of convergence is invariant to the selected proportionality constant 

(^frac- 

In view of (a) and (b) above, the optimal MSE for the estimators G^ and Gfc is therefore 
achieved for the choice of /c = G(M^/(^"'"'^^), and is given by Q{T~'^/^^^'^''). Our goal is 
to reduce the estimator MSE to 0{T^^). We do so by applying the method of weighted 
ensembles described in Section [H 

3.2 Weighted ensemble entropy estimator 

For a positive integer L > I = d — 1, choose J = {li, . . . ,1^} to be positive real numbers. 
Define the mapping k{l) = lyM and let k = {k(l);l E I}. Define the weighted ensemble 
estimator 

G^ = 5^u;(/)G,(o- (3.6) 

i&i 

From Theorems 2 and 3l we see that the biases of the ensemble of estimators {Gk{i)'-i ^ ^ ^ 



satisfy ^.1( |2.1[ ) when we set ipi{^) = l^^f " and J = {1, .., d— 1}. Furthermore, the general form 
of the variance of Gk{i) follows ^.2(2.2) because N,M = Q{T). This implies that we can use 



the weighted ensemble estimator G^, to estimate entropy at 0{LriL[d)/T) convergence rate 



by setting w equal to the optimal weight Wo given by (2.3). 



4 Experiments 

We illustrate the superior performance of the proposed weighted ensemble estimator for two 
applications: (i) estimation of the Panter-Dite rate distortion factor, and (ii) estimation of 
entropy to test for randomness of a random sample. 



For finite T direct use of Tfieorem 1 can lead to excessively liigfi variance. Tliis is because 
forcing the condition (2.3) tliat 7,«(i) = is too strong and, in fact, not necessary. The careful 
reader may notice that to obtain 0{T~^) MSE convergence rate in Theorem 1 it is sufficient 
that 7w(z) be of order 0(T^^/^+*/^^). Therefore, in practice we determine the optimal weights 
according to the optimization: 



mm 

w 



subject to 7^(0) = 1, ,^ ^. 

■ •^)yi/2-i/2d| < e^ i e J, 



Ihll2 <V- 



The optimization (4.1) is also convex. Note that, as contrasted to (2.3), the norm of the 
weight vector w is bounded instead of being minimized. By relaxing the constraints jwii) = 
in (2.3) to the softer constraints in (4.1), the upper bound r] on ||tf||2 can be reduced from 



the value riL{d) obtained by solving (2.3). This results in a more favorable trade-off between 
bias and variance for moderate sample sizes. In our experiments, we find that setting t] = 3d 
yields good MSE performance. Note that as T — )■ oo, we must have ■jwi'i) — )• for i G J in 



order to keep e finite, thus recovering the strict constraints in (2.3). 



For fixed sample size T and dimension d, observe that increasing L increases the number 



of degrees of freedom in the convex problem (4.1 ), and therefore will result in a smaller value 



of e and in turn improved estimator performance. In our simulations, we choose / to be 
L = 50 equally spaced values between 0.3 and 3, ie the k are uniformly spaced as 

X (a — l)ix , 
a aL 

with scale and range parameters a = 10 and x = 3 respectively. We limit L to 50 because 
we find that the gains beyond L = 50 are negligible. The reason for this diminishing return 
is a direct result of the increasing similarity among the entries in I, which translates to 
increasingly similar basis functions ipi{l) = l^^"^. 

4.1 Panter-Dite factor estimation 

For a (i-dimensional source with underlying density /, the Panter-Dite distortion-rate func- 
tion [6j for a g-dimensional vector quantizer with n levels of quantization is given by 6{n) = 
n~'^/i J f'il^'i^'^\x)dx. The Panter-Dite factor corresponds to the functional G{f) with g{f^ x) = 
^-2/qj-2/(q+2)j^j > 0) + /(/ = 0). The Pauter-Dlte factor is directly related to the Renyi 
Qf-entropy, for which several other estimators have been proposed [3 El El 112] ■ 

In our simulations we compare six different choices of functional estimators - the three es- 
timators previously introduced: (i) the standard kernel plug- in estimator G^, (ii) the bound- 
ary truncated plug-in estimator G^ and (iii) the weighted estimator G^ with optimal weight 



w = w* given by (4.1), and in addition the following popular entropy estimators: (iv) his- 



togram plug- in estimator [7], (v) fc-nearest neighbor (/i;-NN) entropy estimator fT7\ and (vi) 
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(a) Variation of MSE of Panter-Dite factor estimates as a function of sample size 
T. From the figure, we see that the proposed weighted estimator has the fastest 
MSE rate of convergence wrt sample size T (d = 6). 




-♦-Standard kernel plug-in estimator 
-T-Truncated kernel plug-In estlmatoij; 

Histogram plug-in estimator 
-*-k-nearest neighbor estimator 
-•-Entropic graph estimator 
-^-Weighted kernel estimator 
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(b) Variation of MSE of Panter-Dite factor estimates as a function of dimension d. 
From the figure, we see that the MSE of the proposed weighted estimator has the 
slowest rate of growth with increasing dimension d {T = 3000). 

Figure 1: Variation of MSE of Panter-Dite factor estimates using standard kernel plug- in 
estimator [14], truncated kernel plug-in estimator (3.3), histogram plug-in estimator [17], k- 
NN estimator [20], entropic graph estimator [18] and the weighted ensemble estimator (3.6). 

entropic fc-NN graph estimator [3l [H]. For both G^ and G^, we select the bandwidth pa- 
rameter /c as a function of M according to the optimal proportionality k = M^/^^"'"'^^ and 
N = M = T/2. 

We choose / to be the d dimensional mixture density f{a,b,p,d) = pfp{a,h,d) -|- (1 — 
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p)fu{d)] where d = 6, fp^a, b, d) is a d-dimensional Beta density with parameters a = 6, 6 = 6, 
fu{d) is a (i-dimensional uniform density and the mixing ratio p is 0.8. The reason we 
choose the beta-uniform mixture for our experiments is because it trivially satisfies all the 
assumptions on the density / listed in Section 3.1, including the assumptions of finite support 
and strict boundedness away from on the support. The true value of the Panter-Dite factor 
6{n) for the beta-uniform mixture is calculated using numerical integration methods via the 
'Mathematica' software (http://www.wolfram.com/mathematica/). Numerical integration 
is used because evaluating the entropy in closed form for the beta-uniform mixture is not 
tractable. 

The MSE values for each of the six estimators are calculated by averaging the squared error 
[6i{n) — 6{n)]'^, i = 1, ..,r?7, over m = 1000 Monte-Carlo trials, where each Si{n) corresponds 
to an independent instance of the estimator. 

4.1.1 Variation of MSE with sample size T 



The MSE results of the different estimators are shown in Fig. 1(a) as a function of sample size 
T, for fixed dimension d = 6. It is clear from the figure that the proposed ensemble estimator 
G^ has significantly faster rate of convergence while the MSE of the rest of the estimators, 
including the truncated kernel plug-in estimator, have similar, slow rates of convergence. It 
is therefore clear that the proposed optimal ensemble averaging significantly accelerates the 
MSE convergence rate. 

4.1.2 Variation of MSE with dimension d 

For fixed sample size T and fixed number of estimators L, it can be seen that e increases 
monotonically with d. This follows from the fact that the number of constraints in the convex 



problem 4.1 is equal to d + 1 and each of the basis functions "ipiil) = l^^'^ monotonically 
approaches 1 as c? grows, . This in turn implies that for a fixed sample size T and number of 
estimators L, the overall MSE of the ensemble estimator should increase monotonically with 
the dimension d. 



The MSE results of the different estimators are shown in Fig. 1(b) as a function of 
dimension d, for fixed sample size T = 3000. For the standard kernel plug-in estimator 
and truncated kernel plug-in estimator, the MSE increases rapidly with d as expected. The 
MSE of the histogram and fc-NN estimators increase at a similar rate, indicating that these 
estimators suffer from the curse of dimensionality as well. On the other hand, the MSE of 
the weighted estimator also increases with the dimension as predicted, but at a slower rate. 
Also observe that the MSE of the weighted estimator is smaller than the MSE of the other 
estimators for all dimensions d > 3. 

4.2 Distribution testing 

In this section, we illustrate the weighted ensemble estimator for non-parametric estimation 
of Shannon differential entropy. The Shannon differential entropy is given by G{f) where 
g{f,x) = — log(/)/(/ > 0) + /(/ = 0). The improved accuracy of the weighted ensemble 
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(a) Entropy estimates for random samples corresponding to 
hypothesis Hq (experiments 1-500) and Hi (experiments 501- 
1000). 
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(b) Histogram envelopes of entropy estimates for random sam- 
ples corresponding to hypothesis Hq (blue) and Hi (red). 

Figure 2: Entropy estimates using standard kernel plug-in estimator, truncated kernel plug- 
in estimator and the weighted estimator, for random samples corresponding to hypothesis 
Ho and Hi. The weighted estimator provides better discrimination ability by suppressing 
the bias, at the cost of some additional variance. 

estimator is demonstrated in the context of hypothesis testing using estimated entropy as a 
statistic to test for the underlying probability distribution of a random sample. Specifically, 
the samples under the null and alternate hypotheses Hq and Hi are drawn from the prob- 
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ability distribution f{a,b,p,d), described in Section IV.A, with fixed d = 6, p = 0.75 and 
two sets of values of a, b under the null and alternate hypothesis, Hq : a = ao,b = bo versus 
Hi : a = ai,b = bi. 

First, we fix oq = 60 = 6 and ai = bi = 5. The density under the null hypothe- 
sis /(6, 6, 0.75, 6) has greater curvature relative to /(5, 5, 0.75, 6) and therefore has smaller 
entropy. Five hundred (500) experiments are performed under each hypothesis with each 
experiment consisting of 1000 samples drawn from the corresponding distribution. The true 
entropy and estimates Gk, Gk and G„, obtained from each instance of 10^ samples are shown 



in Fig. 2(a) for the 1000 experiments. This figure suggests that the ensemble weighted es- 
timator provides better discrimination ability by suppressing the bias, at the cost of some 
additional variance. 

To demonstrate that the weighted estimator provides better discrimination, we plot the 
histogram envelope of the entropy estimates using standard kernel plug-in estimator, trun- 
cated kernel plug-in estimator and the weighted estimator for the cases corresponding to 



the hypothesis Hq (color coded blue) and Hi (color coded red) in Fig. 2(b) Furthermore, 
we quantitatively measure the discriminative ability of the different estimators using the 
deflection statistic ds = |/ii — /iol/yo^+^i; where /iq and ctq (respectively /ii and cri) are 
the sample mean and standard deviation of the entropy estimates. The deflection statistic 
was found to be 1.49, 1.60 and 1.89 for the standard kernel plug-in estimator, truncated 
kernel plug-in estimator and the weighted estimator respectively. The receiver operating 
curves (ROC) for this entropy-based test using the three different estimators are shown in 



Fig. 3(a) The corresponding areas under the ROC curves (AUC) are given by 0.9271, 0.9459 
and 0.9619. 

In our final experiment, we fix ag = &o = 10 and set ai = 61 = 10 — 6, perform 500 
experiments each under the null and alternate hypotheses with samples of size 5000, and 



plot the AUC as 6 varies from to 1 in Fig. 3(b) For comparison, we also plot the AUC for 



the Neyman-Pearson likelihood ratio test. The Neyman-Pearson likelihood ratio test, unlike 
the Shannon entropy based tests, is an omniscient test that assumes knowledge of both the 
underlying beta-uniform mixture parametric model of the density and the parameter values 
oq, &o and ai, bi under the null and alternate hypothesis respectively. Figure 4 shows that the 
weighted estimator uniformly and significantly outperforms the individual plug-in estimators 
and comes closest to the performance of the omniscient Neyman-Pearson likelihood test. 
The relatively superior performance of the Neyman-Pearson likelihood test is due to the 
fact that the weighted estimator is a nonparametric estimator that has marginally higher 
variance (proportional to 11^*112) as compared to the underlying parametric model for which 
the Neyman-Pearson test statistic provides the most powerful test. 

5 Conclusions 

We have proposed a new estimator of functionals of a multivariate density based on weighted 
ensembles of kernel density estimators. For ensembles of estimators that satisfy general 



conditions on bias and variance as specified by '^.1(2.1) and ^.2(2.2) respectively, the weight 
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— standard kernel plug-in estimator 
— Truncated kernel plug-in estimator 
— Weighted estimator 
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False Positive rate 



(a) ROC curves corresponding to entropy estimates obtained 
using standard and truncated kernel plug-in estimators and 
the weighted estimator. The corresponding AUG are given by 
0.9271, 0.9459 and 0.9619. 
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(b) Variation of AUG curves vs 5{= ao — ai, 60 — 61) correspond- 
ing to Neyman-Pearson omniscient test, entropy estimates us- 
ing the standard and truncated kernel plug-in estimators and 
the weighted estimator. 

Figure 3: Comparison of performance in terms of ROC for the distribution testing problem. 
The weighted estimator uniformly outperforms the individual plug-in estimators. 

optimized ensemble estimator has parametric 0{T^^) MSE convergence rate that can be 
much faster than the rate of convergence of any of the individual estimators in the ensemble. 
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The optimal weights are determined as a solution to a convex optimization problem that 
can be performed offline and does not require training data. We illustrated this estimator 
for uniform kernel plug-in estimators and demonstrated the superior performance of the 
weighted ensemble entropy estimator for (i) estimation of the Panter-Dite factor and (ii) 
non-parametric hypothesis testing. 

Several extensions of the framework of this paper are being pursued: (i) using fc-nearest 
neighbor (fc-NN) estimators in place of kernel estimators; (ii) extending the framework to the 
case where support S is not known, but for which conditions '^.l and ^.2 hold; (iii) using 
ensemble estimators for estimation of other functionals of probability densities including 
divergence, mutual information and intrinsic dimension; and (iv) using an /i norm ||w||iin 
place of the I2 norm \\w\\2 in the weight optimization algorithm (2.3) so as to introduce 
sparsity into the weighted ensemble. 
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Appendices 

Outline of appendix 

We first establish moment properties for uniform kernel density estimates in Appendix |A} 
Subsequently, we prove theorems |2] and [3] in Appendix [B] 

A Moment properties of boundary compensated uni- 
form kernel density estimates 

Throughout this section, we assume without loss of generality that the support S = [—1, l]"^. 
Observe that Ifc(X) is a binomial random variable with parameters M and Uk{X) = Pr(Z G 
Sk{X)). The probability mass function of the binomial random variable Ifc(X) is given by 

Pr{h{X) = l)= (^] (f/fc(X))'(l - U,{X)r~K 
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(A.l) 



Define the error function of the truncated uniform kernel density, 

efc(X) = h{X)-E[h{X)] 
hjX) U,{X) 
MVkiX) VkiX) 

^ Ei=l(lx,g5fc(X) - UkjX)) 

MVk{X) 
Also define the error function of the standard uniform kernel density, 

efc(X) = h{X)-E[h{X)] 

= (M\4(X)/A;)efc(X), 

and note that when X G Si{k), e^i^X) = efc(X). 

A.l Taylor series expansion of coverage 

For any X & §, the coverage function Uk{X) can be represented by using a d order Taylor 
series expansion of / about X as follows. Because the density / has continuous partial 
derivatives of order rf in S, for any X G S, 



f/fc(X) = / f{z)dz 

JSk{X) 

d 

= /(X)\4(X) + 5^c,,(X)V;^+^/^(X) + o((A;/M)2), (A.2) 

where Ci^k are functions which depend on k and the unknown density /. This implies that 
the expectation of the density estimate is given by 

E|fi(X)] = Ut{X)IVt{X) 

= /W + |:^a.(A-)(-|)"'%o((-|)). (A.3) 

A.2 Concentration inequalities for uniform kernel density estima- 
tor 

Because \k{X) is a binomial random variable, standard Chernoff inequalities can be applied 
to obtain concentration bounds on \k{X). In particular, for < p < 1/2, 

Pr{h{X) > {l+p)MUk{X)) < e-A^^'=(^)pV4^ 

Pr(lfe(X) < (1 - p)MUk{X)) < e-^^'=(^)p'/4. (A.4) 
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Let \]{X) denote the event {l-pk)MUk{X) < \k{X) < (1 +pk)MUkiX), where pk = l/{k^/^) 
for some fixed 5 e (2/3, 1). Then, for k = 0{M^), 

Pri^'iX)) = Oie'Pl'') = e(M), (A.5) 

where C(M) satisfies the condition hmM-s>oo M"/C(M) = for any a > 0. Also observe that 
under the event \\{X), 

^ hjX) U.jX) 
""'^^^ MV,{X) Vu{X) 

= 0{pkUu{X)/Vk{X)) = 0{pk) = 0{l/{k'/^)). (A.6) 

A. 3 Bounds on uniform kernel density estimator 

Let Br{X) be an Euchdean ball of radius r centered at X. Let X be a Lebesgue point of /, 
i.e., an X for which 

lim — f = f{X). 

Because / is an density, we know that almost all X G S satisfy the above property. Now, fix 
e G (0, 1) and find e^ > such that 

JBrix)f(y)^y f(v\^ inf(v\ 
sup T. f\X) < e/2/(X). 

o<r<.r JBAx) ^y 

For small values of /c/M, B^^{X) C Sk{X) and therefore 

(1 - e/2)/(X)\4(X) < Uk{X) < (1 + e/2)/(X)\4(X) (A.7) 

This implies that under the event ^{X) defined in the previous subsection, 

(l-e)eo< UX) <(l + e)eoo. (A.8) 

Let \\q{X) denote the event that ffc(X) = 0. Let ^i{X) denote the event 1 <= \k{X) <= 
(1 -pk)MUk{X) and ^2{X) denote \k{X) >= (1 +pk)MUkiX). Then conditioned on the 
event tli(X) 

l/k< ffc(X) <(l + e)eoo. (A.9) 

and conditioned on the event \]2{X) 

(l-e)eo< ffc(X) <2'^M/k. (A.IO) 

Observe that \\o{X), tli(X), \\2{X) and \\{X) form a disjoint partition of the event space. 



17 



A.4 Bias 

Lemma 4. Let '~f{x, y) be an arbitrary function with d partial derivatives wrt x and sup^, \'~f{x,[ 
oo. Let Xi, ..,Xjv/,X denote M + 1 i.i.d realizations of the density f . Then, 



< 



E[7(A(Z),Z)]-E[7(/(Z),Z)] = 5^CM(7(x,t/))(A;/Mr/'^ + o((A;/M)), (A.ll^ 



1=1 



where ci^i{'~f{x,y)) are functionals of '~f and f. 

Proof. To analyze the bias, first extend the density function / as follows. In particular, 
extend the definition of / to the domain S^; = [—2,2]'^ while ensuring that the extended 
function /e is different iable d times on this extended domain. Let Sk{X) = {Y : ||X — y||i < 
dk/2} be the natural un-truncated ball. Let Uk{X) = f^^^ ,^. fe{z)dz. Define the function 
fk{X) = Uk{X)/{k/M). For any X G S, using this extended definition. 



Uk{X) 



fe{z)dz 



su{X) 



f{X){k/M) + Y, c,{X){k/Mf+^l' + o{{k/Mf), (A.12) 



i=l 

where Cj are only functions of the unknown density /g. Also define fk{X) = E[ffc(X) | X]. 
Define the interior region §j{k) = {X G S : Sfe(X) n S^ = 0}. Note that fk{X) = X(X) for 
all X G §i{k). Now, 

E[7(/fc(Z), Z)] - E[7(/(Z), Z)] = E[7(/fc(Z), Z) - 7(/(Z), Z)] + E[7(/fc(Z), Z) - 7(/fc(Z), Z)] 

= E[7(A(Z), Z) - 7(/(Z), Z)] + E[lzes-s,w(7(A(Z), Z) - 7(/fc(Z), Z))] 
= / + //. (A.13) 

A. 4.1 Evaluation of I 



E[7(/fc(Z),Z)-7(/(Z),Z)] 

d 

5^e[7«(/(Z),Z)(/,(Z)-/(Z))^ 
j=i 

d 

J2ciiAii^,ym/My/'' + o{{k/M)), 



(A.14) 



i=l 



where Cii^i{'j{x,y)) are functionals oi'~f{x,y) and its derivatives. 



A. 4. 2 Evaluation of II 

Let m = M/2, kM = k/M and km = {k/mY^'^. Define mappings %, 3^r and 3^s- S — S/(fc) 
— )■ !B as follows. Let u{X) denote the unit vector from the origin to X, and define 3'f,(X) = 
u{X) n !B. Let S/(m) be a reference set. Define 3^r{^) = "^l-^) H S/(wi). Let 4(X) = 
||g^b(X) - X||. Finally define ?',(X) = n{X)u{X), where n(X) satisfies WM^) - S'siX)]] = 
(m/kfl%{X). For each X G S - S/(A;), let /,(X) = IIS^^lX) - 'Js{X)\\ and /^a:.(X) = 
||5'6(X) — J'rlX)!!. Let U denote the set of all unit vectors: 11 = U|xes-S7(fc)}w(X). Ob- 
serve that, by definition, the shape of the regions Sk{X) and S'm(5's(X)) is identical. This is 
illustrated in Fig. |4j 

Analysis of /m(?'s(X)), /m(5's(X)) ^'^(X) can represented in terms of "JsiX) as ^'^(X) = 
3^s{.X) + ls{X)u{X). Using Taylor series around 3^5(X), /m(3^s(-^)) can then be evaluated as 

4(:j,(x)) = Urn{'Js{x))/Vm{^s{x)) 

d 

= /(3^,(X)) + 5^c,^,(x)(3^.(X))C(X) + oa:^(X)), (A.15) 

1=1 

where the functionals Cj^gr^(x) depend only on the shape of the regions Sk{X) or S'm(3's(X)) 
and therefore only on ^'^(X). Similarly, 

fm{X) = Um{X)/{l/2) 

d 

= /(J,(X)) + ^Q,5.(x)(3^6(X))C(X) + oa:^(X)), (A.16) 

j=i 

where the functionals Cj,3-j,(x) again depend only on ^'^(X). This implies that for any fixed 
M e U and corresponding Xb G "B, for any function r]{x) and positive integer q G {l,..,d}, 
integration over the line /(X(,) = {X{, — cu{Xh)] c G (0, lmax{Xh))} 

v{z){U{z)-f{z)rdz 

zeiiXh) 
d 

= 5^Q,,,,(X,)C,,(X) + o(C,(X)), (A.17) 

i=q 

and 

r/(Z)(/„(Z)-/(Z)rrfZ 



/. 



d 



i=q 



where the functions Cj,g,^(X;,) and Cj,g,^(X;,) depend only on Xf,, q, rj and are independent of 
Z and fc. 
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Analysis of fk{X), fk{X) 9^fe(X) can be represented in terms of X as S^biX) = X + 
kmlr{X)u{X). Identically, this gives, 

Mx) = Uk{x)/Vk{x) 

d 

= fiMX)) + Y,h,Mx){UX))kUl{X) + o{kUt{X)). (A.19) 

i=l 

and 

UX) = Uu{X)/kM 

d 

= f{UX)) + Y.6,,,,ix){UXWMUX) + o{kUt{X)). (A.20) 

j=i 

This implies that for any fixed u ElL and corresponding Xf, G S, integration over the line 
l{Xb) = {Xb - cu{Xb); c e (0, k^UaxiXb))} 

r^{Z){MZ) - f{Z)ydZ 

ZGl{Xb) 
d 

J2krAXb)kyi,,{x) + o{kiC{x)), (A.21) 

i=q 



and 



7^{z){Mz) - f{z)ydz 

zei{Xt) 
d 

Y,c.,r,,{Xb)kyi,,{X) + o{kiC{X)). (A.22) 

i=q 



Analysis of II 

// = E[lzes^s,ikMfkiZ), Z) - 7(A(Z), Z))] 

i^iUZ),Z)-^{MZ),Z))fiZ)dZ 

Z&-%i(k) 

d 

l{Z=Xi,-c«(Xb)} 

{Xf,eS}U{cG(0,fc„Ua:r{^f,))} j = l 

d 



J2 [i^HfiXb),Xb){h{z)-f{z)y] f{z)dz 

« d 

\ i{z=x,-e«(x,)} Y. W'^{AXb).Xb)[m) - f{z)y\ f{z)dz 

J {Xie'3}u{ce{o,k,nUa^{Xt))} -^^ ^ ^ 

d 
Yl ci2.(7(a:, ym/My/' + o((A;/M)), (A.23) 



1=1 
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where Ci2,i{'y{x,y)) are functionals oi'~f{x,y) and its derivatives. This imphes that 
E[7(/,(Z))]-E[7(/(Z))] = / + // 

d 

= J2c,M^,y)){k/MfUo{{k/M)), (A.24) 

where the functionals Ci^i{'y{x,y)) are independent of k. D 

A. 5 Central Moments 

Since \k{X) is a binomial random variable, we can easily obtain moments of the uniform 
kernel density estimate in terms of Uk{X). These are listed below. 

Lemma 5. Let ^{x) be an arbitrary function satisfying sup^ 17(2^)1 < oo. Let Xi, ..,Xm,X 
denote M + 1 i.i.d realizations of the density f . Then, 

E[7(X)e^(X)] = 1{,=2}C2(7(:^)) (^) + « (^ ' (A-25) 

E[7(X)e^(X)] = 1|,=2}C2(7(:^)) (^) + ^ (^ ' (^'26) 

where 02(7(0:)) «s a functional of 'j and f . 
Proof. When r = 2, 

V[f,(X)] = E[e^(X)] 

Uk{X)il-UkiX)) 
Mlf(x) 

- ^<"' -m. (A.2T) 



MVfc(X) V^ 

For any integer r > 3, 

E[eUX)] = E[l^(x)eUX)]+E[l^.(x)eUX)] 

= O (^) = o(l/A:). (A.28) 

Observe that Vk{X) = Q{k/M) and therefore E[e^(X)] = 0(1/A;) + o(l/A;). This implies, 



E[7(X)e^(X)] = l{,=2Ml{x)) 



l)^"{\ 



When X e S/(A;), efc(X) = efc(X). Also Pr(X e S7(A;)) = o(l). This resuh in conjunction 
with the fact that efc(X) = {MVk{X) / k)ek{X) , and Vfc(X) = Q{k/M) gives 



E[7(X)e^(X)] = 1|,=2}C2(7(:^)) (^) + o Q) 



D 
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A. 6 Cross moments 

Let X and Y be two distinct points. Clearly the density estimates at X and Y are not 
independent. Observe that the uniform kernel regions Sk{X), Sk{Y) are disjoint for the set 
of points given by ^k '■= {-^? ^} • 11-^ ^ ^||i ^ 2{k/MY^'^, and have finite intersection on 
the complement of ^fc. 

Intersecting balls 

Lemma 6. For a fixed pair of points {X, Y} G "^k, o-nd positive integers q, r, 

Cov[eliX),eliY)] = l{,=,,=,y ( ~^™^^^ ) + ^ (]^ 

Proof. For a fixed pair of points {X, Y} G "^k, the joint probability mass function of the 
functions \k{X) ,\k{Y) is given by 

Pr{h{X) = hMY) = ly) = l{i^+i,<M}(j^j){Uk{X)y-{Uk{Y)yy{l-Uk{X)-Uk{Y)r~'^-^y. 

Denote the high probability event \\{X)r]\\(Y) by \\{X, Y). Define \k{X), hiX) to be binomial 
random variables with parameters {Uk{X),M — q} and {Uk{Y),M — r} respectively. The 
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covariance between powers of density estimates is then given by 

1 



Cov{f',{X),faY)) 
1 



Cov{\l{X),\l{Y)) 



M^+rv^{X)VaY) 

1 
M^^rv^{X)VaY) 

1 



M'i+ry^{X)VaY) , 

) Y.^% [^Kl'^(^) = ^-MY) = ly) - Pr{\u{X) = QPr{h{Y) = /,)] 

) E ^'^^ [^Kl'^(^) = ^- ^^(Y) = ly) - Pr{h{X) = QPr{h{Y) = ly)] 

llVUl{X)Ul{Y) 






M^+rV^{X)VaY) ) ^^^^ (/. X . . . X /. - g + l){ly x . . . x /, - r + 1) 
(M X . . . X M - (g + r - l))Pr(U(X) = /,, i,(F) = /,) 

-(M X . . . X M - g + 1)(M X . . . X M - r + l)Pr(\k{X) = QPrihiY) = ly) 
P{X)r{Yy 



+ o 



M 



Mq+r 



X 



E [(M X . . . X M - (g + r - l))Pr(U,(X) = /., ife(F) = /, 



i>{x,Y) 

-(M X ... X M - (g - 1))(M X ... X M - (r - l))Pr(ife(X) = /,)Pr(U(r) = /,) 
'P{X)r{Yy 



M 



Ml+r 



X 



[(M X . . . X M - (g + r - 1)) - (M X . . . X M 



l))(Mx ... xM-(r-l))] 



-qrP{X)nY) 



1 



Then, the covariance between the powers of the error function is given by 

Cov{el{X),el{Y)) = Cot;((f,(X) - E[f,(X)])^ (f,(r) - E[f,(r)])'-) 



q r 



a=l 6=1 
q r 



aj\b 
q\ I r 



{-nux)m-nuY)])'cov{r,{x),ii{Y)) 

[{-f{X))\-f{Y)f + o{l)]Cov{it{X):il{Y)) 






a=l 6=1 



M 



M 



L{q=i,r=l} 



-f{X)f(Y) \ 



M 






1 

M 



D 
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Disjoint balls For {X, Y} G "^l, there is no closed form expression for the covariance. 
However we have the following lemma by applying the Cauchy-Schwartz inequality: 

Lemma 7. For a fixed pair of points {X, Y} G ^^, 

Cov[el{X),el{Y)] = l{,=,,r=i}0 Q) + ^ (l) ■ 
Proof. 



\Cov[el{X),el{Y)]\ < A/Vief (X)]V[er(r)] 



l{,=i,.=i}0(-l+o 



1\ /I 



D 



Joint expression 

Lemma 8. Let 71 (x), 72(2:^) be arbitrary functions with 1 partial derivative wrt x and 
sup^|7i(a;)| < 00, sup^|72(x)| < 00. Le^ Xi, ..,Xm,X, Y denote M + 2 i.i.d realizations 
of the density f. Then, 

Cot;[7i(X)e^,(X),72(Y)eKY)] = l{,=^,r=iMMx),^2ix)) (^^^ + ^ (]^) ' ^^'^^^ 

Cot;[7i(X)e^,(X),72(Y)eKY)] = l{,=i,.=i}C5(7i(x),72(a:)) (^^) + ^ (i) ' ^^'^^^ 

where C5(7i(x),72(x)) is a functional of'ji^x), 72(0;) and f. 

Proof. Let the indicator function 1a,^{X,Y) denote the event A^ : {X,Y} G \E'^. Then 

Cot;[7i(X)e^,(X),72(Y)e^(Y)] = I + D, 

where '/' stands for the contribution form the intersecting balls and 'D' for the contribution 
from the dis-joint balls. / and D are given by 

I = E[l^,{^,Y)Cov[MX)eliX),^,iY)eliY)]], 
D = E[(l-U,(X,Y))Cot;[7i(X)e^,(X),72(r)eUy)]]. 
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When lA,(X,r) 7^ 0, we have {X,Y} G ^l- Then, 
/ 



E[U,(X,Y)7i(X)72(Y)e^,(X)e^(Y)] 
E[lA,(X,Y)7i(X)72(Y)Ex,Y[e^WeUr)]] 



< E 
= E 



lAjX,Y)7i(X)72(Y)^Ex[e^,^(X)]EY[er(r)] 
1a,(X,Y)7i(X)72(Y) (i|,=i,.=i}0 Q) +0 Q 
l{,=i,.=i}0 (0 +« (0) (7i(^)72(x) + 0(1)) 
hg=i,r=i}0 (Ij+oylj) (7i(a;)72(a;) + 0(1)) 

l{.=V=l}C5,l(7l,72)(^)+o(^ 



Ak{x,y)dy ]dx 
k 



M 



dx 



where the bound is obtained using the Cauchy-Schwarz inequahty and using Eq. A.28[ Also, 
D = E[(l-U,(X,Y))7i(X)72(Y)Ex,Y[Cot;(e^,(X),eUr))]] (A.31) 



l{g=l,.=l}C5,2(7l,72) [mJ+^I]^ 



This gives 



Cot;[7i(X)e^(X),72(Y)e^,(Y)] = l|,=i,.=i|C5(7i(x),72(x)) (]^) + « (^ 
Again, since X G §>i{k) imphes efc(X) = efc(X) and Pr(X. G §i{k)) = o(l), 

Cot;[7i(X)e^(X),72(Y)e^,(Y)] = l|,=i,.=i|C5(7i(x),72(x)) (]^) + « (^ 
This concludes the proof. 
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B Bias and variance results 

Lemma 9. Assume that U{x, y) is any arbitrary functional which satisfies 

(i)sup|f/(0,y)| = Gi < 00, 



[a) sup |f/(x, y)| = 6*2/4 < 00, 



[ii) sup \U{x,y)\e{M) = G3< 00, 

xe{l/k,pn),y 
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(m)E[ sup |f/(2;,y)|]e(M) = G'4< oo. 

x&{:Pl,2'iM/k),y 

Let Z denote Xj for some fixed i G {1, .., A^}. Let Cz he any random variable which almost 
surely lies in the range (/(Z), ffc(Z)). Then, 

E[|f/(Cz,Z)|]<oo. 

Proof. We will show that the conditional expectation E[|f/(C^, Z)| | Xat] < oo. Because 
< Co < f{X) < Coo < oo by (^.1), it immediately follows that 

E[|f/(Cz,Z)|] =E[E[|t/(a,^)| I X;v]] < oo. 
Also observe that eo < f{Z) < too and therefore pi < f{Z) < p^. Finally observe that 



the events \]i{Z) and \]2{Z) occur with probability 0{e{M)). Using dATSJ ), (|X9]), ( |A.10[ ), 
conditioned on Xn, 

E[\UiCz,Z)\] = E[^,^z)\UiCz,Z)\]+E[\^z)\UiCz,Z)\]+E[^,^z)\UiCz,Z)\]+E[^^z)\Ui^ 
< (Gi + G2) + (Gs + G2) + {Gi + G2) + (G2) 
= d + 4G2 + G3 + G4 < 00. (B.l^ 
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Proof of Theorem [H 



Proof. Using the continuity of g"'{x,y), construct the following third order Taylor series of 
5f(ffc(Z),Z) around the conditional expected value /fc(Z) = E[ffc(Z) | Z]. 

g{h{Z), Z) = ^(ffc(Z), Z) + ^'(ffe(Z), Z)efc(Z) 
+ i/(f,(Z),Z)e^.(Z) + l^(3)(Cz,Z)e^(Z), 

where Cz ^ (ffclZ), ffe(Z)) is defined by the mean value theorem. This gives 



E[((7(ffc(Z),Z)-(7(ffc(Z),Z))] 



E 



y(f,(Z),Z)e^(Z) 



+ E 



6 



^^^^(Cz,Z)e^(Z) 



Let A(Z) = ^g^^\(z,'Z)- Direct application of Lemma 9 in conjunction with assumption 
(yi.5) implies that E[A^(Z)] = 0(1). By Cauchy-Schwarz and applying Lemma p] for the 
choice q = Q, 



E[A(Z)e^(Z)]|< JE[A2(Z)]E[ef.(Z)]=o 
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By observing that the density estimates {ffc(Xj)}, i = 1, . . . ,N are identical, we therefore 
have 



E[G,] - G{f) = E[g{h{Z), Z) - g{f{Z), Z)] 

'1 



E[g{h{Z),Z) -g{f {Z),Z)]+E 



/(f,(Z),Z)e^(Z) 



+ oil/k). 



By Lemma H^ and Lemma p^ for the choice g = 2, in conjunction with assumptions (^.3) and 
{A A), this imphes that 



i=l ^ ^ 



^ci^i{9{x,y)) 



d 



M 



i/d 



i/d 



+ c.(/(/.(x),x))(i)+o(i + A 
+ c.(/(/(.),x))(i)+o(i + A 



Zl^M 



j=l 



M 



i/d 



+ '^■^1^+°^ + ^ 



where the last but one step follows because, by (A. 3), we know fk{Z) = f{Z) + o(l). This 
in turn implies C2{P{x)g"{fk{x),x)) = C2{P{x)g"{f{x),x)) + o(l). Finally, by assumptions 
(^.2) and (^.4), the leading constants Ci^j and C2 are bounded. 

Note that the natural density estimate fki^) is identical to the truncated kernel density 
estimate ffc(X) on the set X G S/(A;). From the definition of set S/(/c), Pr(Z ^ S') = 
0((A;/M)V^) = o(l). 

E[G,] - G{f) = E[^(f,(Z), Z) - g{f{Z), Z)] 

= E[l|zes,(fc)}^(ffc(Z), Z) - g{f{Z), Z)] + E[l|zes-s,W}^(ffe(Z), Z) - ^(/(Z), Z)] 

= 1 + 11 (B.2) 



Using the exact same method as in the Proof of Theorem |2| using (A.3) and (A. 25), and 
the fact that Pr(Z ^ §j{k)) = 0{{k/MY/'^) = o(l), we have 



I = c^,Mx, y)) (^) '^' + c2(/(/(x))) (0 + 4 ^ + ( 



M ) 



2/d^ 



Because we assume that g satisfies assumption (yi.5), from the proof of Lemma ^ for 
Z e$- §i{k), we have E[g{fk{Z), Z) - g{f{Z), Z)\ = 0(1). This implies that. 



// = Emzes-sAk)}g{h{Z),Z)-gifiZ),Z)] 



E 



E[gihiZ), Z) - gifiZ), Z)] \ {Z E § - S,(A;)} x Pr(Z ^ S,(A;)) 



0(1) X Oiik/MY^'') = 0{{k/Mf'^). 



(B.3) 
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This implies that 

E[Gfc]-G(/) = 1 + 11 

u 

Proof of Theorem [3l 

Proof. By the continuity oi g^^\x^ y), we can construct the following Taylor series of 5f(ffc(Z), Z) 
around the conditional expected value fkiT")- 

(?(f,(Z),Z) = (7(ffc(Z),Z) + (?'(f,(Z),Z)efe(Z) 

where ^z G (^(Ez[ffc(Z)], c/(ffc(Z))). Denote (^^(^z, Z))/A! by ^(Z). Further define the 
operator M(Z) = Z - E[Z] and 

Pi = M((7(A(X,),Xi)), 

qi = M(^'(/fc(X,),Xi)efc(Xi)), 

\i=2 

Si = M(^(Xi)e^(Xi)) 
The variance of the estimator Gjv(ffc) is given by 
V[G,]=E[(G(/)-E[G(/)]n 

= ^^[{Pi + qi + ri + Sif] 

N -I 
-\ ^^E[(pi + gi + ri + Si){p2 + q2 + r2 + S2)]. 

Because Xi, X2 are independent, we have E[(pi)(p2 + ^'2 + ^2 + S2)] = 0. Furthermore, 

E[(pi + gi + ri + si)2] = E[pi2]+o(l)=V[(7(/fc(Z),Z)]+o(l). 



Applying Lemma|5]and Lemma|8| in conjunction with assumptions {A.?>) and (^.4), it follows 
that 

. E[pi2] = V[^(ffc(Z),Z)] =C4(^(A.(X),X)) 

• E[gig2] = c^{g'{fk{x),x),g'{fk{x),x)) (^) +o(-^) 
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• E[gir2]=o(^) 

• nrir2] = o (^) 

Since gi and S2 are mean random variables 



E[qiS2] = E [gi^(X2)(f(X2) - /fe(X2))' 
= E[giVl/(X2)e^(X2)] 



< 



E[vI/2(X2)]E[g?ef(X2 



V^MZ)l(o(^ 



Direct application of Lemma^in conjunction with assumptions (7l.5) implies that E [\1'^(Z)] = 
0(1). Note that from assumption (yi.3), o (p^) = o(l/M) . In a similar manner, it can be 
shown that E[riS2] = o (^) and E[siS2] = o (jj)- This implies that 



V[G, 



l^r 21 (iV-l)^r 



N 



N 



1 1 



1 



C4{9{fk{x),x)) Ij^j +C5{g'{fk{x),x),g'{fk{x),x)) i—j + o (j^ + ^ 
cMfix), x)) (1) + c,{g'{f{x), x),g'{f{x),x)) (^) + « (]^ + ^ 



where the last but one step follows because, by (A. 3), we know fk{Z) = f{Z) + o(l). This 



in turn implies C4{g{fkix),x)) = C4{g{f{x),x)) + o(l) and c^{g'Uk{.x),x),g\fk{x),x)) = 
c^i^g' {f (x) , x) , g' {f (x) , x)) + o(l). Finally, by assumptions {A.2) and (^.4), the leading con- 
stants C4 and C5 are bounded. 

Because of the identical nature of the expressions of ek{X) and efc(X) in Lemma [s] and 
Lemma [8l it immediately follows that 



V[G, 



ca 



N 



^^ti)+"(i + ^ 



This concludes the proof of Theorem [3j 
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Figure 4: Illustration for the proof of Lemma |4j 
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